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When analyzing the data that arises from exome or whole-genome sequencing studies, 
window-based tests, (i.e., tests that jointly analyze all genetic data in a small genomic 
region), are very popular. However, power is known to be quite low for finding associations 
with phenotypes using these tests, and therefore a variety of analytic strategies may be 
employed to potentially improve power. Using sequencing data of all of chromosome 3 
from an interim release of data on 2432 individuals from the UK10K project, we simulated 
phenotypes associated with rare genetic variation, and used the results to explore the 
window-based test power. We asked two specific questions: firstly, whether there could 
be substantial benefits associated with incorporating information from external annotation 
on the genetic variants, and secondly whether the false discovery rate (FDRs) would be 
a useful metric for assessing significance. Although, as expected, there are benefits to 
using additional information (such as annotation) when it is associated with causality, we 
confirmed the general pattern of low sensitivity and power for window-based tests. For 
our chosen example, even when power is high to detect some of the associations, many 
of the regions containing causal variants are not detectable, despite using lax significance 
thresholds and optimal analytic methods. Furthermore, our estimated FDR values tended 
to be much smaller than the true FDRs. Long-range correlations between variants — due 
to linkage disequilibrium — likely explain some of this bias. A more sophisticated approach 
to using the annotation information may improve power, however, many causal variants of 
realistic effect sizes may simply be undetectable, at least with this sample size. Perhaps 
annotation information could assist in distinguishing windows containing causal variants 
from windows that are merely correlated with causal variants. 



Keywords: rare genetic variants, SNV, false discovery rate, multiple testing, genomic annotation, whole genome 
sequencing, window-based tests, stratified false discovery rate 



INTRODUCTION 

In genome-wide association studies, stringent testing thresholds 
have been recommended (and required by editors) to control 
the rate of identification of single nucleotide polymorphisms 
(SNPs) that may be falsely associated with a disease or trait of 
interest (Risch and Merikangas, 1996; Dudbridge and Gusnanto, 
2008). The most commonly used threshold is 5 x 10~ 8 , which 
controls the probability of making at least one false positive 
conclusion [the family- wise error rate (FWER)] to 5%, assum- 
ing approximately one million independent tests. Several studies 
have estimated approximately the same threshold, but derived 
it from different arguments: Risch and Merikangas used an 
early estimate of the potential number of genes in the genome 
(Risch and Merikangas, 1996), Dudbridge and Gusnanto exam- 
ined the number of independent tests when performing infinitely 
dense genotyping of genetic polymorphisms (Dudbridge and 
Gusnanto, 2008), and empirical thresholds have been obtained 
using extensive permutation analyses (Li and Ji, 2005; Dudbridge 



and Gusnanto, 2008; Pe'er et al, 2008). This threshold assumes 
that all available common genetic polymorphisms are each tested 
against the disease, once. However, sequencing is now known 
to identify millions of genetic alterations that may be extremely 
rare — unique to one individual or only occurring in a small hand- 
ful of people. These new sequence variants (single nucleotide 
variants or SNVs) may not have been previously observed, even 
in large collections of individuals such as the 1000 genomes 
(Abecasis et al, 2012), and linkage disequilibrium between these 
rare alterations and nearby known markers is usually very small. 
Previous estimates of genome-wide significance thresholds have 
not considered this spectrum of allele frequencies in their calcu- 
lations (Xu et al., accepted), and as a result, the former standard 
significance threshold for controlling type 1 error rates may not 
be adequate (Browning and Thompson, 2012). 

Using univariate tests, power to detect associations between 
SNVs and disease can be extremely low for rare SNVs, even when 
power is excellent for common variants. Hence, a whole new suite 
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of test statistics has recently been developed to jointly analyze all 
of the genetic variation (in particular the rare genetic variation) 
in a chosen window or region of the genome ( Asimit and Zeggini, 
2010; Bansal et al., 201 1; Burkett and Greenwood, 2013). One very 
popular method is the sequence kernel association test (SKAT) 
(Wu et al, 2011), which is a score test derived from a random 
effects model where the effects of all variants in a window are 
assumed to follow a normal distribution. 

One challenge of these methods is that a single SNV may 
participate in several different tests resulting from a variety of 
different window choices, weighting strategies, or test statistics. 
Unlike SNP genotyping of common polymorphisms, where it is 
possible to estimate an upper bound on the number of SNPs to be 
tested, given known population-specific linkage disequilibrium 
patterns, the potential number of region-based tests performed 
has no known upper bound. Many different windows could be 
defined spanning the same genomic region (e.g., Brisbin et al., 
2012). Furthermore, the genetic variants to be jointly analyzed 
do not need to be physically adjacent, but could, for example, 
lie in genes in the same pathway. Although choosing an appro- 
priate threshold is already challenging when considering exome 
sequence data, the difficulties are exacerbated for whole-genome 
sequence data. In this latter context, window choices may be quite 
arbitrary. 

Genome-wide significance thresholds for region-based tests do 
need to be established; we believe empirically-derived thresholds 
are probably necessary, and have recently shown that they can 
be effectively estimated by predicting the genome-wide threshold 
from empirical estimates obtained on smaller genomic regions 
(Xu et al., accepted). We demonstrated, in a sample of 2432 indi- 
viduals from the UK10K consortium, that genome-wide thresh- 
olds for the SKAT test (Wu et al, 2011) in windows of 50 
rare variants (overlapping by 25 variants) are expected to be 
near 7e-08. 

Although it is certainly of interest to develop appropriate 
thresholds for deciding genome-wide significance for region- 
based tests, an alternative perspective on multiple testing cor- 
rections for sequence-based analyses may be worth exploring. 
Additional sources of repeated testing arise from the choice of 
which variants to prioritize and which test statistic to use. The 
range of different test statistics available can lead to very different 
results on the same set of variants; some statistics are most power- 
ful when a large proportion of variants in a window are associated 
with an increasing disease risk (acting in the same direction), 
whereas other statistics may be optimal for smaller proportions 
of causal variants (Lee et al., 2012). Various minor allele fre- 
quency (MAF) thresholds can also be applied to restrict analysis 
to only "rare" genetic variants (Price et al, 2010). Also, scores 
such as PolyPhen-2 (Polymorphism Phenotyping v2) (Adzhubei 
et al, 2010) and SIFT (Sorting Intolerant From Tolerant) (Ng and 
Henikoff, 2002), which predict the probable functional impact of 
amino-acid changes induced by SNVs within exons, have been 
incorporated into region-based rare-variant tests. These publicly 
available genomic annotations can be used to select subsets of 
SNVs that are more likely to be associated with disease, or to 
give some SNVs more weight (Price et al., 2010; Wu et al., 2011; 
Lopes et al., 2012). Annotation of the entire genome is rapidly 



improving and non-coding regions are also known to contain 
many functional elements (Maher, 2012). 

Therefore, it may not be sufficient to establish one signifi- 
cance threshold that will control the genome-wide FWER for 
region-based testing of rare genetic variation. An alternative strat- 
egy should be more open to exploiting external knowledge when 
deciding which associations are interesting, and these alterna- 
tive approaches may lead to better power to detect true causal 
associations. 

For a given choice of test statistic, windows, and weighting or 
prioritization of variants, permutation analysis of all the data will, 
of course, lead to an appropriate empirical family-wise signifi- 
cance threshold for any desired type 1 error rate. However, this is 
likely to require very large amounts of computation for genome- 
wide sequence data. Another possible approach to interpreting 
the results of multiple tests is to consider the false discovery rate 
(FDR) (Benjamini and Hochberg, 1995) instead of the FWER for 
choosing a significance threshold. The FWER is the probability 
of making at least one false rejection of the null hypothesis. For 
m independent tests, and if the probability of type 1 error for 
each test is chosen to be a, then the FWER rate can be written 
as FWER = 1 — (1 — a) m . Therefore, as the number of tests, m, 
increases, the significance threshold needed at each test, a, must 
get smaller in order to control FWER. In contrast, the FDR is the 
proportion of all rejected tests that are truly null — instead of con- 
trolling the probability of at least one false positive test, the FDR 
attempts to control the proportion of false positive tests which is 
a less stringent criterion. 

FDR has several potential advantages over p-value measures 
of significance. Firstly, by controlling the (estimated) proportion 
of false positive associations among the total number of signif- 
icant tests, it is usually possible to identify more true positive 
associations overall. This feature can be extremely beneficial to 
sensitivity and power when there is an apriori expectation of 
many true associations. Secondly, if the number of tests per- 
formed is increasing due to a range of choices for windows, test 
statistics, or variant selection in the same genomic regions, such 
tests can be expected to be positively correlated. An upper bound 
on the FDR has been demonstrated for situations with positive 
dependence (Benjamini and Yekutieli, 2001), such as would be 
expected in this situation. Furthermore, empirical estimates of 
FDR have been previously shown to be quite tolerant of corre- 
lations (Efron, 2007); in that paper by Efron, the estimated FDR 
rates were close to true values despite the correlations. However, 
recent work has highlighted that FDR estimates may be strongly 
biased and have very high variance in the presence of certain cor- 
relation patterns (Schwartzman and Lin, 2011). We therefore feel 
that the potential advantages of FDR methods over FWER meth- 
ods in the presence of correlated tests warrant examination in the 
context of rare variant analysis. 

Additionally, the use of stratified FDRs may enable further 
increases in power to detect associations. A stratified FDR analysis 
implies that the test statistics can be divided into subsets or strata 
with varying probability that the null hypothesis is true (Sun et al., 
2006; Greenwood et al., 2007). Strata should be defined based on 
information that is external to the current study: for example, 
biologic plausibility or previously identified associations could be 
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used to define strata. Estimates of FDR within each stratum can 
then be combined, and if there truly is variability in the propor- 
tion of null tests across strata, the sensitivity and specificity to 
detect true associations can be enhanced through the stratified 
analysis. 

RATIONALE 

We therefore undertook this investigation to examine the power 
of window-based association tests, and how power is affected by 
analytic approaches that depend on external information such 
as genomic annotation. Due to the novelty of these window- 
based tests as well as the number of possible ways to implement 
them, analysts are likely to be particularly tempted to run multi- 
ple analyses with different choices for weights or subsets. Hence, 
we specifically wanted to know whether FDR estimation could be 
a beneficial approach to controlling the number of false-positive 
associations while simultaneously increasing power. 

OBJECTIVES 

Based on our rationale, our goal is to answer the following 
questions: 

- For window-based tests of rare genetic variation, can impor- 
tant power gains be achieved by considering external annota- 
tion information, either by weighting or by implementing a 
stratified FDR approach? 

- Can FDR be accurately estimated for window-based tests of rare 
genetic variation? 

To answer these questions, genetic sequencing data from a pre- 
liminary release of the UK10K project (www.uklOk.org), together 
with external information on amino acid alterations are the foun- 
dation for a simulation study comparing analytic strategies and 
their performance. 

METHODS 
DATA SETS 

Our analyses are based on simulated phenotypes together with 
genetic sequencing data from the UK10K project. 

Sequencing data 

The UK10K project is undertaking whole genome sequencing 
and analysis of approximately 10,000 individuals from the UK 
with the goal of understanding the contribution of rare genetic 
variation to common traits and diseases (www.ukl0k.org). For 
region-based analysis of rare variants in this consortium, an ini- 
tial analysis plan defined regions so that they contain 50 rare 
variants, where "rare" is either MAF <0.01 or MAF <0.05. 
Adjacent regions were chosen to overlap by 25 rare variants. To 
study correlation patterns, we used a portion of this sequencing 
data, specifically chromosome 3 sequencing data from an interim 
release, including 2432 individuals and 2,577,674 genetic variants. 
For an MAF threshold of 0.01 we defined 74,156 analysis windows 
or regions, derived from 1,853,923 rare variants at this threshold. 

PolyPhen-2 scores, which predict the impact of amino acid 
changes on protein structure and function, were obtained for each 
exonic variant (Adzhubei et al, 2010). Among SNVs with MAF 



<0.01 on chromosome 3 in the UK10K data, there were 3296 
variants with PolyPhen-2 scores indicating a "benign" alteration, 
1219 variants coded as "possibly damaging," and 2419 coded as 
"probably damaging." 

Simulated phenotype data — design 

In order to achieve our objective of studying the performance of 
FDR estimation, it was necessary to implement a two-level sim- 
ulation design. At the first level, we randomly selected a set of 
genes (and variants within genes) to influence the phenotype, 
and then using these causal variants, we simulated a continu- 
ous phenotype value for each individual in our data set. This 
step allowed us to calculate 74,156 region-based tests of associa- 
tion across chromosome 3, and thus obtain one estimate of FDR. 
Therefore the second level of the simulation design repeats this 
process 100 times, so that we can describe the variability in the 
FDR estimates and in the distribution ofp-values. In addition, we 
repeated the entire process with three models varying the strength 
of the genetic influences. 

Simulated phenotype data — details 

Within each simulation, we randomly selected 40 genes to be 
causally related to the phenotype from the 1116 genes on the 
chromosome. Rare variants in and near these genes were then 
randomly sampled to be the influential genetic variants. We 
defined, for the purposes of selecting causal variants associated 
with each gene, all variants in the adjacent intergenic intervals 
as potentially associated with the selected gene. We excluded 
singleton variants from consideration since they provide little 
additional benefit to power in tests of rare genetic variation 
(Ladouceur etal., 2013). 

The risk of a variant being causal, and its effect on a contin- 
uous phenotype, were generated as a function of the PolyPhen-2 
score values. The phenotypes y;, i = 1, . . ., n, for n individ- 
uals, were simulated from the model y, = ^ Pj Xy + €; where 
€, ~ N(0, a) represents random variability in the ith person's 
phenotype. The genotype covariates, xu, represent the number of 
minor alleles for individual i at causal variant j, and the effects 
of the causal variants, Pj, were randomly generated following the 
parameters in Table 1. Three different values were chosen for 
the standard deviation associated with the random variability, 
a 6 (0.5, 1.0, 1.5). 

For each value of 0, 100 sets of causal genes and variants 
were randomly selected, and correspondingly, 100 different phe- 
notypes were generated for each person. 

DATA ANALYSES 

In general, the analysis plan involved region-based tests of asso- 
ciation for all defined windows on chromosome 3, followed by 
examination of the joint distribution of the resulting p-values in 
order to estimate sensitivity and FDR. These two phases were 
each undertaken using several strategies to account for external 
annotation information about the genetic variants. 

Region-based tests of association using SKAT 

For each set of phenotypes across all individuals, we performed 
region-based SKAT tests of association (Wu et al, 2011) between 
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Table 1 | Parameter values for simulating phenotypes, dependent on 
PolyPhen-2 scores. 

PolyPhen-2 Probability Mean of Normally Standard deviation 
score that variant distributed effect normally 

is causal on phenotype distributed effect 

on phenotype 



Benign 


0.05 


1.00 


0.5 


Possibly 


0.35 


1.65 


0.5 


damaging 








Probably 


0.45 


2.00 


0.5 


damaging 








Missing 


0.001 


1.20 


0.5 



Background random variability in phsenotype was normally generated with mean 
0 and standard deviation between 0.5 and 1.5. 

the generated phenotype set and all 74,156 partially-overlapping 
regions of 50 rare variants (MAF <0.01) on chromosome 3. 
SKAT is designed to test association between a set of genetic vari- 
ants, such as those identified by sequencing methods in a small 
genomic window, and a phenotype or trait. The regions or win- 
dows for analysis must be chosen by the analyst, and may reflect 
gene boundaries or may be simply an arbitrary partitioning of 
the data on each chromosome. For a continuous phenotype, the 
SKAT test statistic can be written 

Q = (y-iL)'K(y-\i), 

where y is a vector of phenotype values, ft is the predicted mean 
of y under null hypothesis, and K is the SKAT kernel matrix, 
which depends on the genotype matrix and a choice of vari- 
ant weights. Under the null hypothesis, the distribution of Q is 
asymptotically equal to a positive quadratic form of standard nor- 
mal distributions, and the p-value can be calculated using Davies 
exact method or other approximation methods (Wu et al., 2011). 

Implementation of SKAT analyses on chromosome 3, incorporating 
annotation information 

The SKAT window-based analyses were performed in three dif- 
ferent ways to give different priority to variants with genomic 
annotations. 

(1) Analysis type "N": We used the default SKAT weighting of 
variants which ignores any annotation information and we 
included only variants with MAF <0.01. Here the weights 
were defined to give more weight to variants with smaller 
minor allele frequencies following a Beta distribution with 
parameters a\ = 1 and a.2 = 25 (Wu et al., 2011). 

(2) Analysis type "P": Here we weighted each variant also by 
PolyPhen-2 scores. For this purpose, we assigned weights 
of 0.5 for unknown or missing PolyPhen-2 scores, 0.25 for 
benign variation, 0.75 for possibly damaging, and 0.85 for 
probably damaging scores. Note, this analysis assumed that 
variants known to involve benign changes to the protein 
were given lower weights than variants where the change had 
unknown impact. 



(3) Analysis type "S": Here we analyzed only the subset of 
rare variants in a window that had non-missing PolyPhen- 
2 scores. That is, variants were included in the analysis if they 
had been assessed as benign, possibly damaging or proba- 
bly damaging. Note that the window boundaries were not 
changed, so for this strategy there could be very few variants 
jointly analyzed in a window, and many windows (espe- 
cially those outside genes) contained no annotated variants 
and were not analyzed with strategy S. The default SKAT 
weighting of variants was also used here. 

Defining strata and analysis strategies for FDR estimation 

After analysis, region-based test results were summarized across 
all windows (All), windows that contained no variants with 
damaging PolyPhen-2 scores (Stratum 1), and windows that con- 
tained at least one variant with a PolyPhen-2 score of probably 
or possibly damaging (Stratum 2). Note, therefore, that analysis 
type "S" of Stratum 2 (S-Stratum2) describes windows contain- 
ing only damaging variants (possibly or probably), S-Stratuml 
describes windows containing non-annotated or benign variants, 
and that "S-AU" are the windows with any kind of variants. 
Furthermore, we categorized windows into those truly contain- 
ing at least one causal variant (HI), and windows where at 
least one variant was strongly correlated with a causal SNP at 
either r = 0.9 (Hl-Corr0.90) or r > 0.75 (Hl-Corr0.75). All the 
nomenclature is presented in Table 2. By considering the strata, 
as well as the treatment of annotated variants within the test 
statistics, many analysis strategies were defined. For example, "S- 
Stratum2" is the analysis of only the damaging variants (and 
the windows in which they occur); whereas "N-AH" implies 
an analysis of all windows and all variants, with default SKAT 
weights. 

"True " Sensitivity and FDR 

Within the set of tests arising from each analytic strategy, we cal- 
culated the true sensitivity and the true value of the FDR for 
several chosen p-value thresholds. Let p w represent the p-value 
for window w using a particular testing strategy, and let a be a 
p-value threshold defining whether the null hypothesis is rejected. 
With some abuse of notation, let HI w represent the logical event 
that the window w truly contains at least one causal variant. We 
can therefore define "true" sensitivity as 



tSENS : 



ZZ=i[l{Hl w )ni(p w <a)] 



Ew=i'(«l) 



where the sum is across all w = 1 , . . . W windows tested. 
Similarly the "true" FDR can be written 



tFDR: 



E;=i[J(?w <«)nJ(°m w )] 



where I(°Hl w ) indicates that a window does not contain a causal 
variant. These sensitivities and FDRs were averaged across the 100 
simulations for each threshold a. 
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Table 2 | Nomenclature used to describe combinations of analytic 
strategies and subsets of regions analyzed. 



Description 



Nomenclature 



SKAT tests without any weighting based on N 

PolyPhen-2 scores, all rare variants with MAF <0.01 

SKAT tests with weighting based on PolyPhen-2 P 

scores, all rare variants with MAF <0.01 

SKAT tests only on rare variants (MAF <0.01 ) with S 

benign, possibly damaging, or probably damaging 

PolyPhen-2 scores 

All windows All 

Windows that contained only variants with either Stratum 1 or St1 

missing or benign PolyPhen2 scores 

Windows that contained at least one variant with a Stratum 2 or St2 
possibly or probably damaging PolyPhen-2 score 
A stratified analysis that combines results from Strat or Str 

Stratum 1 and Stratum 2 

Windows containing at least one causal variant H1 
Windows containing at least one variant correlated H1-Corr0.90 
with a causal SNP at r > 0.9 

Windows containing at least one variant correlated H1-Corr0.75 
with a causal SNP at r > 0.75 

Standard error associated with random error or noise a 



Estimation of FDR 

We estimated FDRs using three different methods, and com- 
pared the estimates to the true values. The three FDR estimation 
methods that we used are the Benjamini and Hochberg (BH) 
correction of p-values (Benjamini and Hochberg, 1995), the 
beta-uniform model-based parametric approach (BUM) (Pounds 
and Morris, 2003), and the modified Grenander density estima- 
tor based semi-pararametric approach (software called fdrtool) 
(Strimmer, 2008), in each case we estimate the tail area-based 
FDR. The BH method is a step-up p-value adjustment which con- 
trols FDR at level x. Let all the p-values for a given testing strategy 
be ordered from smallest to largest, p(i) , p@) , . . . , p( w ) , . . . , p(y/) ■ 
The Benjamini-Hochberg procedure finds w* , the largest value 
of w, such that p( w *) < t^t, and rejects the null hypothesis for 
all tests with p-values smaller than or equal to p( w *)- Pounds and 
Morris (2003) use the fact that p-values under the null hypothe- 
sis are expected to follow a uniform distribution, and therefore, 
they assume that the distribution of all p-values follows a mixture 
distribution where the uniform is mixed with a Beta distribution, 
denoted/i, 



/( P ) = 7t + (l-7t)/l(p). 

Here, it represents the proportion of test statistics that follow 
the null hypothesis. The estimated FDR for p-value threshold a 
is then the proportion of rejected tests estimated to follow the 
uniform distribution, and can be written 



Where 7t H j, represents an estimated upper bound on Tt, and F(.) 
is the estimated mixture distribution. Instead of assuming a uni- 
form distribution for p-values under the null hypothesis, the 
method of Strimmer (2008) finds an empirical null distribu- 
tion by using a smoothing approach. The mixture distribution is 
written more generally as 

/(p) = no/o(p) + (i-Tio)/i(p). 

Strimmer writes FDR at a given p-value p; as 



FDR s (p,) = tio 



1-Fq(P,; 9) 
1 - F(pi) 



FDRbum (°0 



Hub® 



F(ot) - (1 - a) Tiub + n ub a 



where F(.) and Fb(-) represent the distribution functions for 
the mixture model and the null distribution, respectively. A 
Grenander density estimator is used to obtain a nonparametric 
estimate of the distribution F and a truncated maximum likeli- 
hood is used to estimate the null density. Therefore, these three 
methods encompass very different approaches to the estimation 
of FDRs. 

An approach conceptually similar to Strimmer's was taken 
by Efron (Efron and Tibshirani, 1998; Efron, 2007) and imple- 
mented in the program locfdr (http://cran.r-project.org/web/ 
packages/locfdr/index.html), but an empirical smoother of the 
histogram of test statistics was used instead of the Grenander 
function. We were unable to obtain reasonable results with this 
method and they are not shown. 

Stratified FDR estimation 

To implement stratified FDR estimation, the FDR was estimated 
separately in each stratum. For the combined analysis, a desired 
FDR threshold was chosen and applied separately to the results 
for each stratum. This induces different p-value cutoffs and dif- 
ferent sensitivity rates across the strata, which are then combined 
to obtain overall estimates of sensitivity and power. 

Single point analyses 

For comparison, results from a small number of simulations were 
analyzed using single-marker tests of association. Each SNV along 
chromosome 3 was tested for association with the phenotype, 
using an additive (0, 1, 2) coding for the number of minor alleles 
at the genetic variant. 

RESULTS 

SIMULATED PHENOTYPE DATA 

For each simulation, the phenotypes were designed to depend on 
genotypes at a randomly selected set of 40 genes. Then within 
the chosen set of genes, a variable number of causal variants 
were sampled. Table 3 displays the mean number of causal vari- 
ants selected by our two-stage phenotype simulation design, both 
for all variants and among the subset of variants with PolyPhen- 
2 scores. It is evident from Table 3 that almost 50% of the 
causal variants had PolyPhen-2 annotations, and therefore that, 
as desired, the proportion of annotated variants that are causal is 
much higher than the proportion of non-annotated causal vari- 
ants. Also the number of windows that contained at least one 
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Table 3 | Number of causal variants generated in the simulations. 

Category Number of Number of windows containing at Number of windows containing at least 

causal variants least one causal variant one variant correlated with a causal variant at 

r > 0.9 r > 0.75 



All variants 
Variants in Stratum 1 
Variants in Stratum 2 
Variants with 
PolyPhen-2 scores 



21.89 (4.57) 
10.61 (3.18) 
11.28 (2.75) 
12.46 (3.12) 



41.02 (8.26) 
20.96 (6.30) 
20.96 (4.89) 
22.81 (5.25) 



1174.78 (1037.12) 
443.65 (726.19) 
758.79 (786.22) 
845.97 (826.83) 



1709.08 (1442.99) 
779.94 (1106.35) 
970.75 (995.43) 

1190.50 (1110.22) 



Each simulation first selected 40 genes and then selected variants to be causal in and/or near these genes. Mean (and standard deviation) are reported across the 
simulations. 



causal variant is very equally split between Stratum 1 and 2. 
Table 3 also shows that only a small number of analysis windows 
contained at least one causal variant. However, the number of 
windows that contain variants strongly correlated with a causal 
SNV is much higher. It is worth noting that although 40 genes 
were chosen to influence phenotype, it is possible that no causal 
variants were randomly selected for some of these genes. In fact 
the number of genes containing at least one causal variant varied 
between 8 and 22 across the simulations. 

REGION-BASED TESTS OF ASSOCIATION: DISTRIBUTIONS OF 
p-VALUES 

Figure 1 shows QQ-plots of p-values for 3 simulations. The 3 
columns in Figure 1 correspond to 3 different sets of simu- 
lated phenotypes, and the rows correspond to different analytic 
strategies. Region-based test results (rows 1-3) are contrasted 
with single-SNP tests (rows 4-5). When a = 0.5, the power to 
detect association at the window level can be extremely sub- 
stantial and the QQ-plots may deviate markedly from the line 
of expectation. For example, the smallest p-values for simula- 
tion #3 reach 10~ 80 . It is also notable that the most signifi- 
cant test statistics can vary by almost an order of magnitude 
across different simulations. In contrast, when cr = 1.5, there is 
very little power to detect association with the genetic variation 
in the windows, at a sample size of 2432 individuals, for any 
method. Single SNP results can have more power than region- 
based tests, as in simulation #1, or less power, as in simulations 
2 and 3. 

Figure 2 shows histograms of the p-values obtained for all 
region-based tests of association for three different simulations, 
for all tests, and also separately for Stratum 1 (no damaging causal 
variants in the windows) and Stratum 2 (at least one damaging 
causal variant in the window). In each histogram there is a vis- 
ible peak of small p-values, indicating that a subset of the tests 
deviate markedly from the null hypothesis but there is also an 
apparent peak of p-values at or near the value of 1.0. This lat- 
ter spike is more visible when the background random variability 
is at its smallest (a = 0.5), and particularly dramatic for the "S" 
analytic strategy where only annotated variants were analyzed. 
Therefore, we believe that this spike is likely due to violations of 
the asymptotic convergence of the test statistics. For the S strat- 
egy, the data sets of analyzed variants could be very sparse — a 
much smaller number of windows were analyzed, and within 



those windows there may have been little genetic variability. Since 
there was a small number of damaging variants, a window may 
contain only one or two analyzable variants. Furthermore, only a 
few individuals may carry alternative alleles due to the rarity of 
the high risk alleles. When a = 0.5, individuals carrying one of 
these causal variants would appear to have extreme phenotypes 
(large outliers) and this may compromise the convergence of the 
test statistics. 

The tests in Stratum 2 are a very small minority of all tests 
for the N and P analysis strategies, but a large proportion of all 
tests for the S strategy. This can be seen in Figure 2 in the bottom 
row, where the line representing Stratum 2 is higher than Stratum 
1. Visually, it is not very easy to distinguish, in any row of this 
figure, whether one stratum contains a higher proportion of small 
p-values than the other. 

TRUE SENSITIVITY AND FDR FOR DIFFERENT ANALYSIS STRATEGIES 

The capacity to identify windows containing causal variants is 
shown in Table 4 and Figure 3. For a given p-value threshold 
for significance, the sensitivity is the proportion, of the windows 
where the null hypothesis was rejected, that actually contain at 
least one true causal variant. For a p-value threshold of le-05, 
the performance of five different analytic strategies are presented 
in Table 4. Firstly all windows are analyzed with the default 
SKAT weights based on MAF (N-All). We then separate windows 
containing benign or non-annotated variants from windows con- 
taining at least one damaging variant and analyzed these two sets 
separately with standard SKAT (N-Stratuml and N-Stratum 2). 
For Stratum 2 only, we also report analyses either with weights 
(P-Stratum 2) or restricting to the subset of annotated variants (S- 
Stratum 2) using PolyPhen-2 weights (P-Stratum 2). Results are 
also shown for all windows, for windows containing at least one 
causal variant, and for windows correlated with a causal window 
(see Table 2 for nomenclature details). Figure 3 contains some 
additional sensitivity results for Stratum 1 tests for 0 = 0.5. For 
complete results, see Tables S1A-D, 3A-D as well as Figure 4. 

Examining strategy HI, where the focus is only on windows 
containing causal variants, it can be seen that the sensitivities 
tend to be very low for any analytic strategy, and they do not 
increase much as the p-value threshold becomes less stringent. 
This conclusion appears to contradict the highly deviant QQ-plot 
results seen in Figure 1, in particular when a = 0.5. It appears, 
therefore, that some windows containing true causal variants can 
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FIGURE 1 | QQ-plots of p-values for analysis of all chromosome 
3 windows, for 3 different simulated phenotypes that were 
selected to illustrate performance differences. Values of a are 
distinguished by color. Region-based testing using three analysis 



strategies (N-AII, P-AII, and S-AII) are shown in the first three 
rows. The bottom two rows show the results of single-SNP 
analyses, either with all SNPS (row 4) or only annotated SNPs 
(row 5). 
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FIGURE 2 1 Distributions of p-values across 100 simulations, for 
window-based tests of variation at all windows on chromosome 
3 (gray histogram), for Stratum 1 (blue), and Stratum 2 (red). 

The first row shows the standard SKAT weighting of variants (N), 



the second row shows analyses weighted by PolyPhen-2 scores (P), 
and the bottom row shows analyses of only the subset of 
damaging variants (S). Columns 1 to 3 are o = 0.5, 1.0 and 1.5, 
respectively. 



be detected with ease (with high power), but many others appear 
to be indistinguishable from the windows containing no causal 
variants. The best sensitivity for windows truly containing at least 
one causal variant (HI), with a liberal p-value threshold of le-3 
and for the S-Stratum 2 analytic strategy, reaches only 35%. 

However, when we relax our definition of sensitivity, the results 
improve. Our more inclusive definition calls a "causal window" as 
a window containing at least one variant in strong linkage dise- 
quilibrium with a causal variant, where linkage disequilibrium is 
defined using r > 0.9 (Hl-Corr0.90) or r > 0.75 (Hl-Corr0.75). 
Sensitivity then measures the proportion of this larger set of win- 
dows with p-values below the chosen threshold. Table 4 shows 
that sensitivity increases substantially for all analytic strategies, 
and dramatically for N-All and Stratum 1. Using this liberal defi- 
nition of a causal window based on correlation of 0.75, sensitivity 
reaches 48% with a = 0.5, a p-value threshold of le-05, and either 
N-All or S-Stratum2 (Table 4). 

The "true" FDR, or the proportion of rejected tests where the 
window contained no causal variants, is also shown in Table 4. It 



is evident from the very high tFDR values that most of the sig- 
nificant tests correspond to windows that did not contain any 
causal variants, even when we relax or enlarge our definition 
of success to include windows that are strongly correlated with 
those containing causal variants. More focused subsets of win- 
dows, obtained for example by the S-Stratum2 analyses, have 
smaller tFDR estimates, but they are still greater than 50% even 
when a p-value of le-08 is used as the threshold (Table S3A). 
These results demonstrate clearly that it is extremely difficult to 
segregate causal and non-causal windows. 

ESTIMATION OF FDR 

Using three methods for estimating FDR, we then examined the 
proportion of falsely-rejected hypotheses across methods, ana- 
lytic strategies, and with the true values. Table 5 shows, of the tests 
with FDR estimates less than 0.05 (i.e., an FDR threshold of 0.05), 
what proportion is truly null. Tables S2A-I, S4A-I show complete 
results for different analytic strategies and three different FDR 
thresholds. 
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Table 4 | Sensitivity (tSENS) for detection of windows containing causal variants with different analysis strategies, for a p-value threshold of 
1e-05. 



Analysis Sensitivity (tSENS), i.e., the proportion of Proportion of windows where null is falsely 
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FIGURE 3 | Sensitivity (tSENS) at different significance thresholds 
(p-values) and for various analytic strategies. The horizontal axis is 
— Iog!o (p), and the vertical axis is the true sensitivity for detecting windows 
containing at least one associated causal variant, when o = 0.5. N-AII: All 
tests, standard weighting; N-St2: Stratum 2, standard weighting; P-AII: All 
tests, PolyPhen-2 weights; P-St2: Stratum 2, PolyPhen-2 weights; S-AII: 
All tests; subset of damaging variants; S-St2: Stratum 2, subset of 
damaging variants. 



Focusing on only windows that truly contain causal variants 
(HI), Table 5 indicates that over 90% of the rejected tests are actu- 
ally null across a variety of analytic strategies, in stark contrast to 
the estimated FDR values of 0.05. All three methods give very sim- 
ilar results. In fact, the three methods identify the same windows 
as significant: the true null proportions are very similar across the 
methods. 



By expanding the definition of a "true" association to include 
windows that are correlated with causal windows, there is some 
improvement. The proportion of null rejections drops from well 
over 90% to as low as 40-50% when using the BUM method, 
and in the Stratum 2 subset analyses, but these values are 
still far higher than the estimated FDR of 0.05. When using a 
slightly more liberal definition of a true association (r > 0.75; 
Tables S4A-I) the proportion of null rejection falls further. When 
examining other FDR thresholds (Tables S2/4B, S2/4E, S2/4H 
show FDR = 0.25; Tables S2/4C, S2/4F, S2/4I show FDR = 0.50) 
the estimated FDR values are even closer to 1.0. 

STRATIFIED FDR 

A stratified analysis strategy allows for different p-value thresh- 
olds to be applied in different strata. This reflects variability in 
the estimated proportions of truly null hypotheses across strata. 
The results of stratified FDR analysis are also shown in Table 5 
and in the Supplementary Tables. For a chosen FDR threshold 
(here FDR = 0.05), different p-value thresholds are applied in the 
two strata. However, we see no benefit in terms of the number 
of falsely-associated windows or sensitivities associated with the 
stratified analysis. 

DISCUSSION 

In this paper, we have explored the potential of using FDRs 
together with genomic annotation to improve the ability to detect 
associations with rare genetic variants using window-based tests. 
Returning to our Objectives, we found that, as expected, using 
annotation information improved power, since this was built into 
our simulation design. However, power remained low. Also, we 
did not find that FDR estimation was a particularly useful tool 
in this context. The proportion of significant yet not-associated 
windows was very large, and the estimates of FDR were extremely 
biased. We discuss this bias below. 

We based our exploration on an interim release of sequenc- 
ing data from the UK10K project, including approximately 2.5 
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FIGURE 4 | Additional results for sensitivities (tSENS) and the p-value 
threshold. The horizontal axis is — logio(p), and the vertical axis is the true 
sensitivity for detecting windows containing at least one associated causal 
variant. Solid lines: Analytic strategy "N'.' Dashed lines: Analytic strategy "P 



Dotted lines: Analytic strategy "S." Three different colors represent All-tests 
(black), Stratum 1 (blue), and Stratum 2 (red), respectively. The three rows 
show results for a = 0.5, 1.0, and 1.5, respectively. The three columns contain 
results for H1, H1-Corr0.90, and H1-Corr0.75, respectively. 



million sequence-identified variants on chromosome 3 among 
2432 individuals, and implemented a fairly complex simula- 
tion design. We assumed that there were as many as 40 genes 
on this chromosome with influence on a continuous trait. We 
randomly selected 40 genes from those on the chromosome, 
and then we randomly selected causal genetic variants from all 
genetic variants in or near these genes with probabilities that 
depended on the real PolyPhen-2 scores for the genetic vari- 
ation. As an alternate strategy, we could have fixed the genes 



and variants selected to be causal, and simply varied their 
effect on phenotype across the simulations. However, our cho- 
sen approach incorporates additional variability in the pattern 
of associated variants and their correlations, since we wanted 
to examine the performance of FDR estimation under a vari- 
ety of conditions. Also, for reasons of feasibility, we used pilot 
data on only one chromosome. Patterns of gene density and 
correlations may vary across chromosomes, but our simula- 
tion design hopefully includes enough randomness that results, 



Frontiers in Genetics | Statistical Genetics and Methodology 



January 2014 | Volume 5 | Article 11 | 10 



Xu et al. 



False discovery rates and rare variants 



Table 5 | The performance of estimated FDR methods for window-based rare variant tests, using three different methods for estimation. 



Analytic Identification of windows containing causal Identification of windows either containing causal 

strategy variants variants, or correlated with such windows at r > 0.90 
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0.980 (0.042) 


0.980 (0.036) 


0.952 (0.049) 


0.944 (0.056) 


0.953 (0.049) 


P- 


-All 


0.974 (0.064) 


0.970 (0.107) 


0.975 (0.060) 


0.925 (0.096) 


0.911 (0.130) 


0.927 (0.091) 


P- 


-Strat 


0.971 (0.075) 


0.970 (0.073) 


0.972 (0.075) 


0.943 (0.078) 


0.932 (0.087) 


0.943 (0.078) 
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-All 


0.927 (0.038) 


0.927 (0.037) 


0.920 (0.040) 


0.874 (0.046) 


0.858 (0.046) 


0.862 (0.045) 


S- 


-Strat 


0.927 (0.037) 


0.926 (0.038) 


0.920 (0.040) 


0.882 (0.045) 


0.868 (0.045) 


0.872 (0.046) 



Results show the proportion of windows containing no causal variants (tFDR) for an estimated FDR of 0.05. Results are shown for the simulation with a = 0.5. 
Results are the mean across 100 simulations (standard deviation). 



in general, would be applicable to larger regions or different 
chromosomes. 

The choice of as many as 40 causal genes located on the same 
chromosome was made for two reasons. Firstly, for many com- 
plex traits, it may be extremely likely that there are large numbers 
of genes each with a small influence on the trait; this has been 
suggested for height (Yang et al., 2010), for example, as well as 
several other continuous phenotypes. Secondly, in order to esti- 
mate FDRs using many of the existing methods, it is necessary to 
estimate the proportion of associated (non-null) tests. However, 
if this proportion is too close to zero, then estimation becomes 
extremely difficult. In fact, despite the choice of 40 causal genes, 
the number of windows containing a causal variant is still small 
(Table 3). 

Although the number of causal variants is small, the number 
of windows potentially showing association could be much larger 
due to patterns of linkage disequilibrium leading to extensive cor- 
relations. We therefore implemented a more relaxed definition 
of successfully identifying a true signal: if a window showing a 
significant result contained at least one genetic variant strongly 
correlated with a causal variant (using either r > 0.90 or r > 
0.75), then we counted this as a true identification. Sensitivity 
increased quite substantially with the relaxed definition, and in 
some models reached 50-60%. If we had used a lower level of 
correlation when defining a "true positive" region identification, 
we would undoubtedly have been able to improve our sensitiv- 
ity further. In fact, the correlation, r, is not ideal as a measure 
of the strength of linkage disequilibrium between variants, espe- 
cially for rare genetic variants. A more nuanced consideration 
of haplotypic structure could provide an interesting perspective 
and may lead to improved sensitivities. Our relaxed definition 
of success also raises questions about how to perform fine map- 
ping, since "true positive" windows could be quite genetically 
distant from any causal variants, if there was long-range dis- 
equilibrium. Of course, many studies of real phenotypes have 
identified associations that are located far from any likely gene. 

One of our most striking findings was the discrepancy between 
estimated FDR values (e.g., FDR = 0.05) and the true FDR based 
on the proportion of windows with small p-values that either 
contain a causal variant or are strongly associated with causal 



Table 6 | Setup for calculation of sensitivity, power, and false positive 
rates. 


Region-based test 
result 


Truth 

True Ha 


True Ho Total 


p-value < Threshold: 
reject null hypothesis 


A 


C A + C 


p-value > Threshold: do 
not reject null hypothesis 


B 


D B + D 


Total 


A + B 


C + D 



A B, C, D represent the numbers of tests that fall into the four cells defined by 
the p-value (whether or not the null hypothesis is rejected) and whether (Ha) or 
not (Hq) the region is a causal region. 



variants. Three factors play into this, power, the proportion of 
all tests that are null (tt), and the correlations between windows. 
Table 6 shows a standard 2x2 table setup for calculating sen- 
sitivity, specificity and false positive rates, where FDR = A + c ■ 
Poor power leads to values of A that are too small. Furthermore, 
if C + D is a very large number (tt is large) then it becomes easy 
for C to be much larger than A. Finally, the linkage disequilib- 
rium structure leads to complex patterns of dependence; signals 
resulting from causal genetic variants may be detected in win- 
dows some distance away. Hence, the choice of definition for a 
"causal" window influences the number of tests placed into the 
two columns. 

The lack of power for detection of many of the causally- 
associated regions may have been due to very small MAFs, to the 
presence of only a very small number of causal variants in each 
window, or even to the fact that with 41 causal variants on average 
(Table 3), that the separate signals would be difficult to distin- 
guish. Better power may be obtainable by iteratively correcting 
the phenotype for each consecutively-identified variant or sig- 
nal, and re-running analyses on the residuals, although it would 
be tricky to decide exactly how to implement such a strategy for 
window-based testing. 
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If window-based tests perform poorly when there are only a 
few causal variants, perhaps single-SNP tests have more power. 
Therefore, in a few selected cases, we compared window-based 
results to SNP results (Figure 1), and in fact we concluded that 
there is not necessarily one strategy that will be more powerful. 
The relative performance will depend on the density of causal 
variants in small genomic regions, and whether several causal 
variants occur within the same window. This comparison is not 
the primary focus of this paper, but could be an interesting 
question for future research. 

We compared three methods for estimation of FDRs. The BH 
method (Benjamini and Hochberg, 1995) implements a step- 
up strategy adjusting each p-value in turn, and provides an 
upper bound for the FDR under certain patterns of depen- 
dency (Benjamini and Yekutieli, 2001). This method does not 
require an estimate of the proportion of the tests that are truly 
null. In contrast, we implemented two methods that do estimate 
this proportion, one parametric (Pounds and Morris, 2003) and 
one semi-parametric (Strimmer, 2008). Despite the very differ- 
ent model assumptions, in fact the resulting estimates of FDR 
were very comparable across the three methods. However, none 
of the estimates were at all close to the true FDR values. All 
methods were probably being misled by associations seen at cor- 
related variants, located in nearby windows. Performance for the 
model-based methods may also have been adversely affected by 
the observed spike in the number of p-values of 1.0 (Figure 2). 
The presence of this spike violates the assumption of a uniform 
distribution of p-values under the null hypothesis. Finally, it is 
plausible that the differences between the approaches would be 
more visible for smaller FDR estimates. 

Annotation of the genome, including regulatory regions as 
well as genomic conservation, is improving daily (Maher, 2012). 
It seems intuitive that such information should aid in identify- 
ing associations between genetic variability and phenotypes. We 
simulated data where PolyPhen-2 scores influenced not only the 
probability that a variant was causal, but also the magnitude 
of the effect of the variant on the phenotype. It was therefore 
not surprising to find that use of weighted test statistics, where 
weights were derived from the PolyPhen-2 scores, improved the 
sensitivity and power. Similarly, analyses of only the subset of 
possibly- or probably-damaging annotated variants performed 
better than analysis of all variants, as expected. The differences 
in power, however, were quite small and smaller than we had 
previously anticipated. Although we could have prepared a more 
complex set of relationships between genomic annotation infor- 
mation and our simulated phenotypes, perhaps using multiple 
different annotation measures, we do not feel that our primary 
conclusions here would be altered, given the same analysis strate- 
gies. Power of the tests will still be largely driven by the MAFs and 
density of the causal variants within the windows, as well as the 
magnitude of their effects. An interesting point to consider here is 
that when small p-values are seen in correlated windows — that do 
not contain any causal variants but are in linkage disequilibrium 
with causal variants — the windows with the smallest p-values 
may be less likely to contain annotated variants due to MAF 
variation. Annotation information may therefore be useful for 
fine-mapping. 



Another option for coping with multiple testing would be to 
develop a Bayesian model for the strength of each test statistic's 
association, where informative prior distributions are assumed 
for the parameters measuring the strength of association between 
genomic regions and phenotype. This approach could be con- 
sidered conceptually as an extension of stratified FDR to the 
case where each test statistic has its own stratum, defined by the 
genomic annotation information for all variants in each region. 
The result of such an analysis would be a (posterior) probabil- 
ity that the genetic variation in a chosen region is associated 
with the phenotype. Regions with annotations likely to contain 
causal genetic variation would have higher prior probabilities 
of association with phenotype, and hence also higher posterior 
probabilities of association. This avenue may be worth further 
consideration and exploration. 

Prior to undertaking this simulation study, our hypothesis was 
that use of stratified FDR methods, using genomic annotation 
information, could lead to improved power to detect associations 
with rare genetic variation. We did not find any improved per- 
formance using stratified FDR methods in this context. However 
in another context where the strata may more clearly delineate 
the probability of a true association, or if informative prior infor- 
mation could be constructed effectively, then perhaps a return to 
consideration of these issues would be warranted. 

SUPPLEMENTARY MATERIAL 

The Supplementary Material for this article can be found online 
at: http://www.frontiersin.org/journal/10.3389/fgene.2014. 
00011 /abstract 

Table SI (A-D) | True sensitivity (tSENS) and true FDR (tFDR) for different 
analytic strategies. Each table shows the tSENS and tFDR values for 
different p-value thresholds, ranging from 1 e-08 (Table S1A) to 1 e-03 
(Table S1D). "m" is the mean over the 10 simulations, and "sd" is the 
standard deviation. Nomenclature follows Table 2. 
Table S2 (A-l) | Sensitivities and proportion of truly null window tests 
(tFDR) for different estimated values of FDR. Each table shows the 
sensitivity and tFDR values associated with a different estimated FDR 
threshold between 0.05 (S2A) and 0.50 (S2I). Nomenclature follows 
Table 2. 

Table S3 (A-D) | True sensitivity (tSENS) and true FDR (tFDR) for different 
analytic strategies examining correlated windows with variants correlated 
at r > 0.75. Each table shows the tSENS and tFDR values for different 
p-value thresholds, ranging from 1e-08 (Table S1A) to 1e-03 (Table S1D). 
"m" is the mean over the 10 simulations, and "sd" is the standard 
deviation. Nomenclature follows Table 2. 

Table S4 (A-l) | Sensitivities and proportion of truly null window tests 
(tFDR) for different estimated values of FDR examining correlated 
windows with variants correlated at r > 0.75. Each table shows the 
sensitivity and tFDR values associated with a different estimated FDR 
threshold between 0.05 (S2A) and 0.50 (S2I). Nomenclature follows 
Table 2. 
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